In this codelab there are several Labs (cells) where you need to write your code to solve the problems. If you need some hints, you may take a look at the Solution page to see the answers.
In [ ]:
import tensorflow as tf
tf.__version__
This codelab requires TensorFlow 1.0 or above. If you see older versions such as 0.11.0rc0, please follow the instruction below to update your local Datalab.
> docker pull gcr.io/cloud-datalab/datalab:local
docker run
command to restart local DatalabTo prepare for the analysis, we will download a training data from BigQuery, a fully managed scalable data warehouse service on Google Cloud. BigQuery provides many kinds of public datasets which makes it a useful datasource for learning data analytics with TensorFlow.
One of the public datasets is NYPD Motor Vehicle Collisions Data which collects all the car accidents happened in NYC from 2012 to the present. In this codelab, we will use it for getting 10,000 pairs of "borough" column and "latitude/longitude" columns.
Let's take a look at the data by executing a BigQuery SQL query. In Cloud Datalab, you can execute BigQuery commands by using the "%%sql" command (see this doc to learn more about the BigQuery commands). Select the cell below and run the query by clicking "Run" on the menu.
In [ ]:
%%sql -d standard
SELECT
timestamp,
borough,
latitude,
longitude
FROM
`bigquery-public-data.new_york.nypd_mv_collisions`
ORDER BY
timestamp DESC
LIMIT
15
In this codelab, we do not care about the car accidents. We just wanted to use the data for getting pairs of "latitude", "longitude" and "Is it Manhattan or not" values. So, we want to do the following preprocessing on this raw data:
So, our SQL with the preprocessing will look like the following. Select the cell below and run it. Please note that this only defines the SQL module "nyc_collisions" that will be used later and does not output anything.
In [ ]:
%%sql --module nyc_collisions
SELECT
IF(borough = 'MANHATTAN', 1, 0) AS is_mt,
latitude,
longitude
FROM
`bigquery-public-data.new_york.nypd_mv_collisions`
WHERE
LENGTH(borough) > 0
AND latitude IS NOT NULL AND latitude != 0.0
AND longitude IS NOT NULL AND longitude != 0.0
AND borough != 'BRONX'
ORDER BY
RAND()
LIMIT
10000
Then, we need to execute the SQL code defined above using BigQuery and import the data into Datalab. For this purpose, Datalab provides BigQuery APIs that allows you to execute the define SQL and import the results as a NumPy array named nyc_cols
. Run the cell below and confirm it loaded 10,000 rows.
In [ ]:
import datalab.bigquery as bq
nyc_cols = bq.Query(nyc_collisions).to_dataframe(dialect='standard').as_matrix()
print(nyc_cols)
print("\nLoaded " + str(len(nyc_cols)) + " rows.")
Let's take a look at what's inside the result. Run the cell below and check the variable is_mt
has an array of 1s and 0s representing each geolocation is in Manhattan or not, and the variable latlng
has an array of pairs of latitude and longitude.
In [ ]:
import numpy as np
is_mt = nyc_cols[:,0].astype(np.int32) # read the 0th column (is_mt) as int32
latlng = nyc_cols[:,1:3].astype(np.float32) # read the 1st and 2nd column (latitude and longitude) as float32
print("Is Manhattan: " + str(is_mt))
print("\nLat/Lng: \n\n" + str(latlng))
(You can skip this lab if you know how to use NumPy)
You might notice that we just used NumPy for extracting the results. NumPy is the most popular Python library for numerical calculations. For machine learning with Python, many people use NumPy for wide variety of numerical operations, including the basic array operations such as reshaping, merging, splitting, filtering, slicing and indexing. Many of TensorFlow APIs are also influenced by NumPy and use similar concepts. If you want to learn machine learning and TensorFlow with Python, we recommend you also learn some of the basics of NumPy too.
In this lab, Let's try a few basic array operations with NumPy. Run the cell below and see what kind of numpy array will be created.
In [ ]:
# create an numpy array with numbers from 0 to 14
A = np.arange(15)
print(A)
Now, add the necessary new code in the following cells, and run them, to get the result described in the comments with NumPy. You should refer to the NumPy Quickstart to learn how to get the results required.
In [ ]:
# reshape the array A into an array with shape in 3 rows and 5 columns,
# set it to variable A, and print it.
# *** ADD YOUR CODE HERE ***
print(A)
# expected result:
# [[ 0 1 2 3 4]
# [ 5 6 7 8 9]
# [10 11 12 13 14]]
In [ ]:
# print() the shape, data type name, size (total number of elements) of the array A
# *** ADD YOUR CODE HERE ***
# expected result:
# (3, 5)
# int64
# 15
In [ ]:
# multiply the array A by the number 2 and print the result
# *** ADD YOUR CODE HERE ***
# expected result:
# [[ 0 2 4 6 8]
# [10 12 14 16 18]
# [20 22 24 26 28]]
In [ ]:
# create a new array that has the same shape as the array A filled with zeros, and print it
# *** ADD YOUR CODE HERE ***
# expected result:
# [[ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 0.]
# [ 0. 0. 0. 0. 0.]]
In [ ]:
# create a new array that has the elements in the right-most column of the array A
# *** ADD YOUR CODE HERE ***
# expected result:
# [ 4 9 14]
In [ ]:
# Collect elements in array B with an index "I % 2 == 0" and print it
B = np.arange(10)
I = np.arange(10)
# *** ADD YOUR CODE HERE ***
# expected result:
# [0 2 4 6 8]
Now we've got the training data. However, it's not ready for training a neural network model yet. If you use the raw data directly, you would fail on the training because the scales of each feature (latitude and longitude in this case) are quite different.
In machine learning, it is very common to preprocess the raw data with feature scaling to normalize the feature data to have the same scale. That make it much easier for machine learning algorithms to compare those features and find relationships between them.
In this codelab, we will use StandardScaler in scikit-learn. Scikit-learn is another popular library for machine learning in Python that provides wide variety of training algorithms, preprocessing and validation tools.
The StandardScaler scales the features so that their mean value will be 0 and standard deviation will be 1. This scaling is called Standardization. Let's Run the cell below and see how it scales the latitudes and longitudes and stores them into a variable latlng_std
.
In [ ]:
from sklearn.preprocessing import StandardScaler
latlng_std = StandardScaler().fit_transform(latlng)
print(latlng_std)
In [ ]:
# *** ADD YOUR CODE HERE ***
Now, all the preprocessing on the training data have been done. Let's see what the data looks like by using Matplotlib, the popular visualization library for Python. In this case we will use scatter() method to plot dots with the pairs of latitude and longitude. Run the cell below and see the plot.
In [ ]:
import matplotlib.pyplot as plt
lat = latlng_std[:,0]
lng = latlng_std[:,1]
plt.scatter(lng[is_mt == 1], lat[is_mt == 1], c='b') # plot points in Manhattan in blue
plt.scatter(lng[is_mt == 0], lat[is_mt == 0], c='y') # plot points outside Manhattan in yellow
plt.show()
You can see that the geolocations in Manhattan are plotted as blue dots, and others are yellow dots. Also, latitudes and longitudes are scaled to have 0 as the center.
Before start training the neural network model, we need to separate out a part of the training data as test data. The test data will be used for checking accuracy of classifications by the model after training. This is common practice in machine learning, so that the performance of your model can be accurately evaluated.
Run the cell below and split the data into 8,000 pairs of training data and 2,000 pairs of test data.
In [ ]:
# 8,000 pairs for training
latlng_train = latlng_std[0:8000]
is_mt_train = is_mt[0:8000]
# 2,000 pairs for test
latlng_test = latlng_std[8000:10000]
is_mt_test = is_mt[8000:10000]
print("Split finished.")
Now, let's use TensorFlow.
TensorFlow is an open source library for machine learning. You can define your own neural network or deep learning model and run the training on your laptop, or use many CPUs and GPUs in the cloud for scalable and faster training and prediction.
TensorFlow provides two kind of APIs:
If you will use common neural network and machine learning models (such as fully-connected neural networks, convolutional neural networks, logistic regressions and k-means), the high level API is recommended. If you want to design your own neural network model with sophisticated or novel algorithms, or if you want to learn the underlying technology used for implementing the high level API, the low level API is the best option.
In this codelab, we will use the high level API first, and then look at the low level API to learn more about the underlying technology.
In [ ]:
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.ERROR) # supress warning messages
# define two feature columns consisting of real values
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=2)]
# create a neural network
dnnc = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[],
n_classes=2)
dnnc
The code above does the following:
In a nutshell, this code defines a neural network like the following illustration, which is the same single neuron we tried with the Playground, where we put latitude and longitude as inputs to x1 and x2 respectively.
Just like we saw on the Playground, the neuron can classify each datapoint into two groups by drawing a single straight line. While training this neuron with the training data, the neuron tries to move the weight and bias values to find what's the best angle and position for the line to classify Manhattan correctly.
So here, we're training a neural network (consisting of a single neuron) to classify whether a geolocation is in Manhattan or not by drawing a single straight line on the map.
In [ ]:
# plot a predicted map of Manhattan
def plot_predicted_map():
is_mt_pred = dnnc.predict(latlng_std, as_iterable=False) # an array of prediction results
plt.scatter(lng[is_mt_pred == 1], lat[is_mt_pred == 1], c='b')
plt.scatter(lng[is_mt_pred == 0], lat[is_mt_pred == 0], c='y')
plt.show()
# print the accuracy of the neural network
def print_accuracy():
accuracy = dnnc.evaluate(x=latlng_test, y=is_mt_test)["accuracy"]
print('Accuracy: {:.2%}'.format(accuracy))
# train the model for just 1 step and print the accuracy
dnnc.fit(x=latlng_train, y=is_mt_train, steps=1)
plot_predicted_map()
print_accuracy()
In the first method plot_predicted_map()
at line 3, we call the predict() method of DNNClassifier
class to get an array of prediction results (10,000 rows) like [1 0 0 1 ... 0 0 1 0]
where 1
means that the neural network believes the geolocation is in Manhattan, and 0
means it's not. By using this array as an indexer for selecting lat
and lng
pairs in each class, the method plots geolocations predicted as Manhattan in blue dots and others in yellow dots.
In the second method print_accuracy()
at line 9, we call the evaluate() method of DNNClassifier
class to calculate the accuracy of the prediction using the test data latlng_test
and is_mt_test
and print it.
After defining these two methods, we call the fit() method of DNNClassifier
class at line 14 to train the model for just one step. A step in the fit()
method moves the weights and bias in the neural network only a little in the direction that reduces the network error. However, it usually it takes thousands of steps for neural networks to find the best weights and bias. So, what you are effectively seeing is that the neural network in the initial state (= before the training) acheives a very low accuracy and cannot classify the Manhattan locations properly.
Finally, let's actually train the neural network! This time, we will train the network by calling fit()
method for 500 steps with the training data latlng_train
and is_mt_train
. Every 100 steps, we will call plot_predicted_map()
and print_accuracy()
to show the current accuracy of the network. Run the cell below and wait for a while until the message "Finished" is printed. You will see the network continually tries to move the weights and bias in small steps to minimize the error and find the best position of the line for classifying geolocations in Manhattan. The final accuracy should be as high as 97%.
In [ ]:
steps = 100
for i in range (1, 6):
dnnc.fit(x=latlng_train, y=is_mt_train, steps=steps)
plot_predicted_map()
print('Steps: ' + str(i * steps))
print_accuracy()
print('\nTraining Finished.')
You just saw that the network can only draw a straight line on the map and classify whether a location is in Manhattan or not. This is so-called Linear Classification. That is the limitation of the single layer neural network and you can only achieve around 97% accuracy because the straight line (linear classification) can't split the geolocation points between Manhattan and Brooklyn with the necessary curved boundary.
We must go deeper. Let's define a deep neural network (DNN). Run the cell below to define a new DNNClassifier
.
In [ ]:
dnnc = tf.contrib.learn.DNNClassifier(
feature_columns=feature_columns,
hidden_units=[20, 20, 20, 20],
n_classes=2)
dnnc
The only difference from the last DNNClassifier
definition is the hidden_units
parameter which defines 4 hidden layers with 20 neurons each. As the network has total of 5 layers, we're now working with a deep neural network ("deep" means you have layers more than 2).
Let's see how the deep neural network works. Run the cell below and wait for a couple of minutes until it finishes training.
In [ ]:
steps = 30
for i in range (1, 6):
dnnc.fit(x=latlng_train, y=is_mt_train, steps=steps)
plot_predicted_map()
print 'Steps: ' + str(i * steps)
print_accuracy()
print('\nTraining Finished.')
You just saw that a DNN can classify locations within Manhattan at around 99.9% accuracy with a curved boundary that fits between Manhattan and Brooklyn. In the next section, we will learn how a DNN is able to recognize and extract the complex patterns in the training dataset by using the power of hidden layers.
In this section, we have learned the following concepts.
To learn more about deep neural networks, please proceed with 3. Why deep neural network can get smarter?.